Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 55
Filter
1.
Drug Discov Today ; 29(6): 103990, 2024 Apr 23.
Article in English | MEDLINE | ID: mdl-38663581

ABSTRACT

The enormous growth in the amount of data generated by the life sciences is continuously shifting the field from model-driven science towards data-driven science. The need for efficient processing has led to the adoption of massively parallel accelerators such as graphics processing units (GPUs). Consequently, the development of bioinformatics methods nowadays often heavily depends on the effective use of these powerful technologies. Furthermore, progress in computational techniques and architectures continues to be highly dynamic, involving novel deep neural network models and artificial intelligence (AI) accelerators, and potentially quantum processing units in the future. These are expected to be disruptive for the life sciences as a whole and for drug discovery in particular. Here, we identify three waves of acceleration and their applications in a bioinformatics context: (i) GPU computing, (ii) AI and (iii) next-generation quantum computers.

2.
NAR Genom Bioinform ; 5(3): lqad082, 2023 Sep.
Article in English | MEDLINE | ID: mdl-37705831

ABSTRACT

Deep learning has emerged as a paradigm that revolutionizes numerous domains of scientific research. Transformers have been utilized in language modeling outperforming previous approaches. Therefore, the utilization of deep learning as a tool for analyzing the genomic sequences is promising, yielding convincing results in fields such as motif identification and variant calling. DeepMicrobes, a machine learning-based classifier, has recently been introduced for taxonomic prediction at species and genus level. However, it relies on complex models based on bidirectional long short-term memory cells resulting in slow runtimes and excessive memory requirements, hampering its effective usability. We present MetaTransformer, a self-attention-based deep learning metagenomic analysis tool. Our transformer-encoder-based models enable efficient parallelization while outperforming DeepMicrobes in terms of species and genus classification abilities. Furthermore, we investigate approaches to reduce memory consumption and boost performance using different embedding schemes. As a result, we are able to achieve 2× to 5× speedup for inference compared to DeepMicrobes while keeping a significantly smaller memory footprint. MetaTransformer can be trained in 9 hours for genus and 16 hours for species prediction. Our results demonstrate performance improvements due to self-attention models and the impact of embedding schemes in deep learning on metagenomic sequencing data.

3.
Bioinformatics ; 39(9)2023 09 02.
Article in English | MEDLINE | ID: mdl-37540201

ABSTRACT

MOTIVATION: Including ion mobility separation (IMS) into mass spectrometry proteomics experiments is useful to improve coverage and throughput. Many IMS devices enable linking experimentally derived mobility of an ion to its collisional cross-section (CCS), a highly reproducible physicochemical property dependent on the ion's mass, charge and conformation in the gas phase. Thus, known peptide ion mobilities can be used to tailor acquisition methods or to refine database search results. The large space of potential peptide sequences, driven also by posttranslational modifications of amino acids, motivates an in silico predictor for peptide CCS. Recent studies explored the general performance of varying machine-learning techniques, however, the workflow engineering part was of secondary importance. For the sake of applicability, such a tool should be generic, data driven, and offer the possibility to be easily adapted to individual workflows for experimental design and data processing. RESULTS: We created ionmob, a Python-based framework for data preparation, training, and prediction of collisional cross-section values of peptides. It is easily customizable and includes a set of pretrained, ready-to-use models and preprocessing routines for training and inference. Using a set of ≈21 000 unique phosphorylated peptides and ≈17 000 MHC ligand sequences and charge state pairs, we expand upon the space of peptides that can be integrated into CCS prediction. Lastly, we investigate the applicability of in silico predicted CCS to increase confidence in identified peptides by applying methods of re-scoring and demonstrate that predicted CCS values complement existing predictors for that task. AVAILABILITY AND IMPLEMENTATION: The Python package is available at github: https://github.com/theGreatHerrLebert/ionmob.


Subject(s)
Machine Learning , Peptides , Peptides/chemistry , Mass Spectrometry/methods , Amino Acid Sequence , Proteomics/methods , Ions
4.
Int J Mol Sci ; 24(12)2023 Jun 16.
Article in English | MEDLINE | ID: mdl-37373385

ABSTRACT

Cancer therapy with clinically established anticancer drugs is frequently hampered by the development of drug resistance of tumors and severe side effects in normal organs and tissues. The demand for powerful, but less toxic, drugs is high. Phytochemicals represent an important reservoir for drug development and frequently exert less toxicity than synthetic drugs. Bioinformatics can accelerate and simplify the highly complex, time-consuming, and expensive drug development process. Here, we analyzed 375 phytochemicals using virtual screenings, molecular docking, and in silico toxicity predictions. Based on these in silico studies, six candidate compounds were further investigated in vitro. Resazurin assays were performed to determine the growth-inhibitory effects towards wild-type CCRF-CEM leukemia cells and their multidrug-resistant, P-glycoprotein (P-gp)-overexpressing subline, CEM/ADR5000. Flow cytometry was used to measure the potential to measure P-gp-mediated doxorubicin transport. Bidwillon A, neobavaisoflavone, coptisine, and z-guggulsterone all showed growth-inhibitory effects and moderate P-gp inhibition, whereas miltirone and chamazulene strongly inhibited tumor cell growth and strongly increased intracellular doxorubicin uptake. Bidwillon A and miltirone were selected for molecular docking to wildtype and mutated P-gp forms in closed and open conformations. The P-gp homology models harbored clinically relevant mutations, i.e., six single missense mutations (F336Y, A718C, Q725A, F728A, M949C, Y953C), three double mutations (Y310A-F728A; F343C-V982C; Y953A-F978A), or one quadruple mutation (Y307C-F728A-Y953A-F978A). The mutants did not show major differences in binding energies compared to wildtypes. Closed P-gp forms generally showed higher binding affinities than open ones. Closed conformations might stabilize the binding, thereby leading to higher binding affinities, while open conformations may favor the release of compounds into the extracellular space. In conclusion, this study described the capability of selected phytochemicals to overcome multidrug resistance.


Subject(s)
Drug Resistance, Neoplasm , Neoplasms , Humans , Molecular Docking Simulation , Doxorubicin/pharmacology , Phytochemicals/pharmacology , ATP Binding Cassette Transporter, Subfamily B/genetics , ATP Binding Cassette Transporter, Subfamily B/metabolism , Cell Line, Tumor
5.
Molecules ; 28(3)2023 Jan 18.
Article in English | MEDLINE | ID: mdl-36770656

ABSTRACT

During the past three decades, humans have been confronted with different new coronavirus outbreaks. Since the end of the year 2019, COVID-19 threatens the world as a rapidly spreading infectious disease. For this work, we targeted the non-structural protein 16 (nsp16) as a key protein of SARS-CoV-2, SARS-CoV-1 and MERS-CoV to develop broad-spectrum inhibitors of nsp16. Computational methods were used to filter candidates from a natural product-based library of 224,205 compounds obtained from the ZINC database. The binding of the candidates to nsp16 was assessed using virtual screening with VINA LC, and molecular docking with AutoDock 4.2.6. The top 9 compounds were bound to the nsp16 protein of SARS-CoV-2, SARS-CoV-1, and MERS-CoV with the lowest binding energies (LBEs) in the range of -9.0 to -13.0 kcal with VINA LC. The AutoDock-based LBEs for nsp16 of SARS-CoV-2 ranged from -11.42 to -16.11 kcal/mol with predicted inhibition constants (pKi) from 0.002 to 4.51 nM, the natural substrate S-adenosyl methionine (SAM) was used as control. In silico results were verified by microscale thermophoresis as in vitro assay. The candidates were investigated further for their cytotoxicity in normal MRC-5 lung fibroblasts to determine their therapeutic indices. Here, the IC50 values of all three compounds were >10 µM. In summary, we identified three novel SARS-CoV-2 inhibitors, two of which showed broad-spectrum activity to nsp16 in SARS-CoV-2, SARS-CoV-1, and MERS-CoV. All three compounds are coumarin derivatives that contain chromen-2-one in their scaffolds.


Subject(s)
COVID-19 , Middle East Respiratory Syndrome Coronavirus , Humans , SARS-CoV-2 , Molecular Docking Simulation , S-Adenosylmethionine
6.
Pharmaceuticals (Basel) ; 15(9)2022 Aug 24.
Article in English | MEDLINE | ID: mdl-36145267

ABSTRACT

The nucleocapsid protein (NP) is one of the main proteins out of four structural proteins of coronaviruses including the severe acute respiratory syndrome coronavirus 2, SARS-CoV-2, discovered in 2019. NP packages the viral RNA during virus assembly and is, therefore, indispensable for virus reproduction. NP consists of two domains, i.e., the N- and C-terminal domains. RNA-binding is mainly performed by a binding pocket within the N-terminal domain (NTD). NP represents an important target for drug discovery to treat COVID-19. In this project, we used the Vina LC virtual drug screening software and a ZINC-based database with 210,541 natural and naturally derived compounds that specifically target the binding pocket of NTD of NP. Our aim was to identify coronaviral inhibitors that target NP not only of SARS-CoV-2 but also of other diverse human pathogenic coronaviruses. Virtual drug screening and molecular docking procedures resulted in 73 candidate compounds with a binding affinity below -9 kcal/mol with NP NTD of SARS-CoV-1, SARS-CoV-2, MERS-CoV, HCoV-OC43, HCoV-NL63, HoC-229E, and HCoV-HKU1. The top five compounds that met the applied drug-likeness criteria were then tested for their binding in vitro to the NTD of the full-length recombinant NP proteins using microscale thermophoresis. Compounds (1), (2), and (4), which belong to the same scaffold family of 4-oxo-substituted-6-[2-(4a-hydroxy-decahydroisoquinolin-2-yl)2H-chromen-2-ones and which are derivates of coumarin, were bound with good affinity to NP. Compounds (1) and (4) were bound to the full-length NP of SARS-CoV-2 (aa 1-419) with Kd values of 0.798 (±0.02) µM and 8.07 (±0.36) µM, respectively. Then, these coumarin derivatives were tested with the SARS-CoV-2 NP NTD (aa 48-174). Compounds (1) and (4) revealed Kd-values of 0.95 (±0.32) µM and 7.77 (±6.39) µM, respectively. Compounds (1) and (4) caused low toxicity in human A549 and MRC-5 cell lines. These compounds may represent possible drug candidates, which need further optimization to be used against COVID-19 and other coronaviral infections.

7.
BMC Bioinformatics ; 23(1): 287, 2022 Jul 20.
Article in English | MEDLINE | ID: mdl-35858828

ABSTRACT

BACKGROUND: Mass spectrometry is an important experimental technique in the field of proteomics. However, analysis of certain mass spectrometry data faces a combination of two challenges: first, even a single experiment produces a large amount of multi-dimensional raw data and, second, signals of interest are not single peaks but patterns of peaks that span along the different dimensions. The rapidly growing amount of mass spectrometry data increases the demand for scalable solutions. Furthermore, existing approaches for signal detection usually rely on strong assumptions concerning the signals properties. RESULTS: In this study, it is shown that locality-sensitive hashing enables signal classification in mass spectrometry raw data at scale. Through appropriate choice of algorithm parameters it is possible to balance false-positive and false-negative rates. On synthetic data, a superior performance compared to an intensity thresholding approach was achieved. Real data could be strongly reduced without losing relevant information. Our implementation scaled out up to 32 threads and supports acceleration by GPUs. CONCLUSIONS: Locality-sensitive hashing is a desirable approach for signal classification in mass spectrometry raw data. AVAILABILITY: Generated data and code are available at https://github.com/hildebrandtlab/mzBucket . Raw data is available at https://zenodo.org/record/5036526 .


Subject(s)
Algorithms , Software , Mass Spectrometry , Proteomics/methods
8.
Bioinformatics ; 37(7): 889-895, 2021 05 17.
Article in English | MEDLINE | ID: mdl-32818262

ABSTRACT

MOTIVATION: Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes. RESULTS: We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration. AVAILABILITYAND IMPLEMENTATION: CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
High-Throughput Nucleotide Sequencing , Software , Algorithms , Humans , Sequence Alignment , Sequence Analysis, DNA
9.
Drug Discov Today ; 26(1): 173-180, 2021 01.
Article in English | MEDLINE | ID: mdl-33059075

ABSTRACT

Next-generation sequencing (NGS) methods lie at the heart of large parts of biological and medical research. Their fundamental importance has created a continuously increasing demand for processing and analysis methods of the data sets produced, addressing questions such as variant calling, metagenomic classification and quantification, genomic feature detection, or downstream analysis in larger biological or medical contexts. In addition to classical algorithmic approaches, machine-learning (ML) techniques are often used for such tasks. In particular, deep learning (DL) methods that use multilayered artificial neural networks (ANNs) for supervised, semisupervised, and unsupervised learning have gained significant traction for such applications. Here, we highlight important network architectures, application areas, and DL frameworks in a NGS context.


Subject(s)
Deep Learning , High-Throughput Nucleotide Sequencing/methods , Metagenomics , Neural Networks, Computer , Biomedical Research/trends , Humans , Metagenomics/methods , Metagenomics/trends
10.
Nucleic Acids Res ; 49(4): e23, 2021 02 26.
Article in English | MEDLINE | ID: mdl-33313868

ABSTRACT

Methods for the detection of m6A by RNA-Seq technologies are increasingly sought after. We here present NOseq, a method to detect m6A residues in defined amplicons by virtue of their resistance to chemical deamination, effected by nitrous acid. Partial deamination in NOseq affects all exocyclic amino groups present in nucleobases and thus also changes sequence information. The method uses a mapping algorithm specifically adapted to the sequence degeneration caused by deamination events. Thus, m6A sites with partial modification levels of ∼50% were detected in defined amplicons, and this threshold can be lowered to ∼10% by combination with m6A immunoprecipitation. NOseq faithfully detected known m6A sites in human rRNA, and the long non-coding RNA MALAT1, and positively validated several m6A candidate sites, drawn from miCLIP data with an m6A antibody, in the transcriptome of Drosophila melanogaster. Conceptually related to bisulfite sequencing, NOseq presents a novel amplicon-based sequencing approach for the validation of m6A sites in defined sequences.


Subject(s)
Adenosine/analogs & derivatives , High-Throughput Nucleotide Sequencing/methods , RNA/chemistry , Sequence Analysis, RNA/methods , Adenosine/analysis , Algorithms , Animals , Chromatography, Liquid , Deamination , Drosophila melanogaster/genetics , HEK293 Cells , HeLa Cells , Humans , RNA, Long Noncoding/chemistry , RNA, Messenger/chemistry , RNA, Ribosomal, 18S/chemistry , Sequence Alignment , Tandem Mass Spectrometry
11.
PLoS One ; 15(12): e0243295, 2020.
Article in English | MEDLINE | ID: mdl-33270795

ABSTRACT

Metrology has been successfully used in the last decade to quantify use-wear on stone tools. Such techniques have been mostly applied to fine-grained rocks (chert), while studies on coarse-grained raw materials have been relatively infrequent. In this study, confocal microscopy was employed to investigate polished surfaces on a coarse-grained lithology, quartzite. Wear originating from contact with five different worked materials were classified in a data-driven approach using machine learning. Two different classifiers, a decision tree and a support-vector machine, were used to assign the different textures to a worked material based on a selected number of parameters (Mean density of furrows, Mean depth of furrows, Core material volume-Vmc). The method proved successful, presenting high scores for bone and hide (100%). The obtained classification rates are satisfactory for the other worked materials, with the only exception of cane, which shows overlaps with other materials. Although the results presented here are preliminary, they can be used to develop future studies on quartzite including enlarged sample sizes.


Subject(s)
Quartz/chemistry , Quartz/classification
12.
BMC Bioinformatics ; 21(1): 102, 2020 Mar 12.
Article in English | MEDLINE | ID: mdl-32164527

ABSTRACT

BACKGROUND: All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. RESULTS: We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). CONCLUSIONS: We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).


Subject(s)
Big Data , Food Analysis/methods , High-Throughput Nucleotide Sequencing/methods , Metagenomics/methods , Whole Genome Sequencing/methods , Biosurveillance , Genome, Bacterial , Metagenome , Microbiota/genetics , Software
13.
Nucleic Acids Res ; 48(7): 3734-3746, 2020 04 17.
Article in English | MEDLINE | ID: mdl-32095818

ABSTRACT

Reverse transcription (RT) of RNA templates containing RNA modifications leads to synthesis of cDNA containing information on the modification in the form of misincorporation, arrest, or nucleotide skipping events. A compilation of such events from multiple cDNAs represents an RT-signature that is typical for a given modification, but, as we show here, depends also on the reverse transcriptase enzyme. A comparison of 13 different enzymes revealed a range of RT-signatures, with individual enzymes exhibiting average arrest rates between 20 and 75%, as well as average misincorporation rates between 30 and 75% in the read-through cDNA. Using RT-signatures from individual enzymes to train a random forest model as a machine learning regimen for prediction of modifications, we found strongly variegated success rates for the prediction of methylated purines, as exemplified with N1-methyladenosine (m1A). Among the 13 enzymes, a correlation was found between read length, misincorporation, and prediction success. Inversely, low average read length was correlated to high arrest rate and lower prediction success. The three most successful polymerases were then applied to the characterization of RT-signatures of other methylated purines. Guanosines featuring methyl groups on the Watson-Crick face were identified with high confidence, but discrimination between m1G and m22G was only partially successful. In summary, the results suggest that, given sufficient coverage and a set of specifically optimized reaction conditions for reverse transcription, all RNA modifications that impede Watson-Crick bonds can be distinguished by their RT-signature.


Subject(s)
RNA-Directed DNA Polymerase/metabolism , Reverse Transcription , Adenosine/analogs & derivatives , Guanosine/chemistry , Guanosine/metabolism , Machine Learning , Methylation , Oligoribonucleotides/chemistry , Transcriptome
14.
Front Genet ; 10: 876, 2019.
Article in English | MEDLINE | ID: mdl-31608115

ABSTRACT

Modification mapping from cDNA data has become a tremendously important approach in epitranscriptomics. So-called reverse transcription signatures in cDNA contain information on the position and nature of their causative RNA modifications. Data mining of, e.g. Illumina-based high-throughput sequencing data, is therefore fast growing in importance, and the field is still lacking effective tools. Here we present a versatile user-friendly graphical workflow system for modification calling based on machine learning. The workflow commences with a principal module for trimming, mapping, and postprocessing. The latter includes a quantification of mismatch and arrest rates with single-nucleotide resolution across the mapped transcriptome. Further downstream modules include tools for visualization, machine learning, and modification calling. From the machine-learning module, quality assessment parameters are provided to gauge the suitability of the initial dataset for effective machine learning and modification calling. This output is useful to improve the experimental parameters for library preparation and sequencing. In summary, the automation of the bioinformatics workflow allows a faster turnaround of the optimization cycles in modification calling.

15.
Sci Rep ; 9(1): 6313, 2019 04 19.
Article in English | MEDLINE | ID: mdl-31004088

ABSTRACT

Many archeologists are skeptical about the capabilities of use-wear analysis to infer on the function of archeological tools, mainly because the method is seen as subjective, not standardized and not reproducible. Quantitative methods in particular have been developed and applied to address these issues. However, the importance of equipment, acquisition and analysis settings remains underestimated. One of those settings, the numerical aperture of the objective, has the potential to be one of the major factors leading to reproducibility issues. Here, experimental flint and quartzite tools were imaged using laser-scanning confocal microscopy with two objectives having the same magnification but different numerical apertures. The results demonstrate that 3D surface texture ISO 25178 parameters differ significantly when the same surface is measured with objectives having different numerical apertures. It is, however, unknown whether this property would blur or mask information related to use of the tools. Other acquisition and analyses settings are also discussed. We argue that to move use-wear analysis toward standardization, repeatability and reproducibility, the first step is to report all acquisition and analysis settings. This will allow the reproduction of use-wear studies, as well as tracing the differences between studies to given settings.

16.
Sci Rep ; 7(1): 14937, 2017 11 02.
Article in English | MEDLINE | ID: mdl-29097782

ABSTRACT

Head and neck cancer (HNC) is the seventh most common malignancy in the world and its prevailing form, the head and neck squamous cell carcinoma (HNSCC), is characterized as aggressive and invasive cancer type. The transcription factor II A (TFIIA), initially described as general regulator of RNA polymerase II-dependent transcription, is part of complex transcriptional networks also controlling mammalian head morphogenesis. Posttranslational cleavage of the TFIIA precursor by the oncologically relevant protease Taspase1 is crucial in this process. In contrast, the relevance of Taspase1-mediated TFIIA cleavage during oncogenesis of HNSCC is not characterized yet. Here, we performed genome-wide expression profiling of HNSCC which revealed significant downregulation of the TFIIA downstream target CDKN2A. To identify potential regulatory mechanisms of TFIIA on cellular level, we characterized nuclear-cytoplasmic transport and Taspase1-mediated cleavage of TFIIA variants. Unexpectedly, we identified an evolutionary conserved nuclear export signal (NES) counteracting nuclear localization and thus, transcriptional activity of TFIIA. Notably, proteolytic processing of TFIIA by Taspase1 was found to mask the NES, thereby promoting nuclear localization and transcriptional activation of TFIIA target genes, such as CDKN2A. Collectively, we here describe a hitherto unknown mechanism how cellular localization and Taspase1 cleavage fine-tunes transcriptional activity of TFIIA in HNSCC.


Subject(s)
Endopeptidases/metabolism , Head and Neck Neoplasms/metabolism , Squamous Cell Carcinoma of Head and Neck/metabolism , Transcription Factor TFIIA/metabolism , Cell Line, Tumor , Cyclin-Dependent Kinase Inhibitor p16/genetics , Down-Regulation , Endopeptidases/genetics , Gene Expression Regulation, Neoplastic , Head and Neck Neoplasms/genetics , Humans , Proteolysis , Signal Transduction , Squamous Cell Carcinoma of Head and Neck/genetics
17.
Bioinformatics ; 33(23): 3740-3748, 2017 Dec 01.
Article in English | MEDLINE | ID: mdl-28961782

ABSTRACT

MOTIVATION: Metagenomic shotgun sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification, i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes corresponding software tools suffer from either long runtimes, large memory requirements or low accuracy. RESULTS: We introduce MetaCache-a novel software for read classification using the big data technique minhashing. Our approach performs context-aware classification of reads by computing representative subsamples of k-mers within both, probed reads and locally constrained regions of the reference genomes. As a result, MetaCache consumes significantly less memory compared to the state-of-the-art read classifiers Kraken and CLARK while achieving highly competitive sensitivity and precision at comparable speed. For example, using NCBI RefSeq draft and completed genomes with a total length of around 140 billion bases as reference, MetaCache's database consumes only 62 GB of memory while both Kraken and CLARK fail to construct their respective databases on a workstation with 512 GB RAM. Our experimental results further show that classification accuracy continuously improves when increasing the amount of utilized reference genome data. AVAILABILITY AND IMPLEMENTATION: MetaCache is open source software written in C ++ and can be downloaded at http://github.com/muellan/metacache. CONTACT: bertil.schmidt@uni-mainz.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Metagenomics/methods , Software , Algorithms , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA
18.
Drug Discov Today ; 22(4): 712-717, 2017 04.
Article in English | MEDLINE | ID: mdl-28163155

ABSTRACT

The progress of next-generation sequencing has a major impact on medical and genomic research. This high-throughput technology can now produce billions of short DNA or RNA fragments in excess of a few terabytes of data in a single run. This leads to massive datasets used by a wide range of applications including personalized cancer treatment and precision medicine. In addition to the hugely increased throughput, the cost of using high-throughput technologies has been dramatically decreasing. A low sequencing cost of around US$1000 per genome has now rendered large population-scale projects feasible. However, to make effective use of the produced data, the design of big data algorithms and their efficient implementation on modern high performance computing systems is required.


Subject(s)
Genome/genetics , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Algorithms , Computing Methodologies , Databases, Genetic , Genomics/economics , Genomics/methods , High-Throughput Nucleotide Sequencing/economics , Humans , Sequence Analysis, DNA/economics
19.
Biomolecules ; 6(4)2016 11 10.
Article in English | MEDLINE | ID: mdl-27834909

ABSTRACT

Combination of reverse transcription (RT) and deep sequencing has emerged as a powerful instrument for the detection of RNA modifications, a field that has seen a recent surge in activity because of its importance in gene regulation. Recent studies yielded high-resolution RT signatures of modified ribonucleotides relying on both sequence-dependent mismatch patterns and reverse transcription arrests. Common alignment viewers lack specialized functionality, such as filtering, tailored visualization, image export and differential analysis. Consequently, the community will profit from a platform seamlessly connecting detailed visual inspection of RT signatures and automated screening for modification candidates. CoverageAnalyzer (CAn) was developed in response to the demand for a powerful inspection tool. It is freely available for all three main operating systems. With SAM file format as standard input, CAn is an intuitive and user-friendly tool that is generally applicable to the large community of biomedical users, starting from simple visualization of RNA sequencing (RNA-Seq) data, up to sophisticated modification analysis with significance-based modification candidate calling.


Subject(s)
Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Computational Biology/methods , High-Throughput Nucleotide Sequencing , Software , User-Computer Interface
20.
BMC Bioinformatics ; 17(1): 394, 2016 Sep 23.
Article in English | MEDLINE | ID: mdl-27663265

ABSTRACT

BACKGROUND: Gene Set Enrichment Analysis (GSEA) is a popular method to reveal significant dependencies between predefined sets of gene symbols and observed phenotypes by evaluating the deviation of gene expression values between cases and controls. An established measure of inter-class deviation, the enrichment score, is usually computed using a weighted running sum statistic over the whole set of gene symbols. Due to the lack of analytic expressions the significance of enrichment scores is determined using a non-parametric estimation of their null distribution by permuting the phenotype labels of the probed patients. Accordingly, GSEA is a time-consuming task due to the large number of required permutations to accurately estimate the nominal p-value - a circumstance that is even more pronounced during multiple hypothesis testing since its estimate is lower-bounded by the inverse number of samples in permutation space. RESULTS: We present rapidGSEA - a software suite consisting of two tools for facilitating permutation-based GSEA: cudaGSEA and ompGSEA. cudaGSEA is a CUDA-accelerated tool using fine-grained parallelization schemes on massively parallel architectures while ompGSEA is a coarse-grained multi-threaded tool for multi-core CPUs. Nominal p-value estimation of 4,725 gene sets on a data set consisting of 20,639 unique gene symbols and 200 patients (183 cases + 17 controls) each probing one million permutations takes 19 hours on a Xeon CPU and less than one hour on a GeForce Titan X GPU while the established GSEA tool from the Broad Institute (broadGSEA) takes roughly 13 days. CONCLUSION: cudaGSEA outperforms broadGSEA by around two orders-of-magnitude on a single Tesla K40c or GeForce Titan X GPU. ompGSEA provides around one order-of-magnitude speedup to broadGSEA on a standard Xeon CPU. The rapidGSEA suite is open-source software and can be downloaded at https://github.com/gravitino/cudaGSEA as standalone application or package for the R framework.

SELECTION OF CITATIONS
SEARCH DETAIL
...